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Abstract —  Uncoordinated  checkpointing  allows  process  au¬ 
tonomy  and  general  nondeterministic  execution,  but  suffers 
firom  potential  domino  effects  and  the  associated  space  over¬ 
head.  Previous  to  this  research,  checkpoint  space  reclama- 
tion  had  been  based  on  the  notion  of  obsolete  checkpoints; 
as  a  result,  a  potentially  unbounded  number  of  nonobsolete 
checkpoints  may  have  to  be  retained  on  stable  storage.  In 
this  paper,  we  derive  a  necessary  and  suScient  condition 
for  identdi^g  all  garbage  checkpoints.  By  using  the  ap¬ 
proach  of  recovery  line  transformation  and  decomposition, 
we  develop  an  optimal  checkpoint  space  reclamation  algo¬ 
rithm  and  show  that  the  space  overhead  for  uncoordinated 
checkpointiag  is  in  foci  bounded  by  N(N  -(•  i)/a  checkpoints 
where  N  is  the  number  of  processes. 

Keytuordt — fonit  toieruce,  message-passing  systems,  unco¬ 
ordinated  checkp<nnting,  rollback  recovery,  garbage  collec¬ 
tion. 

I.  iNTROOUCnON 

Checkp<^ting  and  rollback  recovery  is  an  effective  approach 
to  tolerating  both  hardware  and  software  fianlts.  During  nor¬ 
mal  execution,  the  state  of  each  process  is  periodically  saved 
on  stable  storage  as  a  checkpoint.  When  a  failure  occurs,  the 
process  can  roll  back  to  a  previous  checkpoint  by  reloading  the 
checkpmnted  state.  In  a  message-passing  system,  rollback  prop- 
apotioncan  occur  when  the  rollback  of  a  message  sender  results 
in  the  rollback  of  the  corresponding  receiver.  The  system  is 
then  required  to  roll  back  to  the  latest  available  consistent  set 
of  checkpoints  called  the  recovery  line  to  ensure  correct  recov¬ 
ery  with  a  minimum  amount  of  t<^back.  In  the  worst  case, 
cascading  rollback  propagation  may  result  in  the  domino  effect 
[1]  which  prevents  recovery  line  progression. 

Numerous  checkpointing  and  rollback  recovery  techniques  have 
been  proposed  in  the  literature  for  message-passing  systems. 
Uncoordinated  checkpointing  [2-4]  allows  maximum  process  au¬ 
tonomy  and  general  nondeterministic  execution.  Each  process 
takes  its  checkpoints  independently  and  keeps  track  of  the  de¬ 
pendencies  among  checkpoints  resulting  from  message  commu¬ 
nications.  When  a  failure  occurs,  the  dependency  information  is 
used  to  determine  the  recovery  line  to  which  the  system  should 
roll  back.  The  major  disadvantages  of  uncoordinated  check¬ 
pointing  have  been  the  potential  domino  effect  and  the  space 
overhead  required  for  maintaining  multiple  checkpmnts  of  each 
process. 

This  paper  addresses  the  second  disadvantage  by  developing 
an  optimal  checkpoint  space  reclamation  algorithm  to  minimise 
the  space  overhead.  Several  techniques  have  been  proposed  to 
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addreM  the  lint  iMue,  to  guarantee  recovery  line  progrea- 
ston.  Coordinated  cheifyiiinting  [5,6]  eliminates  the  domino  ef¬ 
fect  by  sacrificing  a  certain  degree  of  process  autonomy.  Extra 
coordination  messages  are  required  to  enforce  the  consistency 
between  the  checkpoints  belonging  to  the  same  checkpointing 
session.  The  run-time  overhead  can  be  reduced  if  certain  op¬ 
timisation  techniques  can  be  employed  [7].  For  applications 
which  require  process  autonomy  in  taking  checkpoints  in  order 
to  exidoit  application-dependent  information  to  checkpoint  at 
the  “right  time”,  e.g.,  when  the  process  state  is  minimal,  lazy 
checkpoint  coordination  [8]  can  be  incorporated  into  an  uncoor¬ 
dinated  checkpointing  protocol  to  provide  a  trade-off  between 
coordination  overhead  and  recovery  efficiency. 

Another  approach  to  eliminating  the  domino  effect  is  to  ex¬ 
ploit  the  pieeewiee  determiniatic  execution  model  [9- 11],  in  which 
each  process  execution  is  viewed  as  a  number  of  deterministic 
state  intervals  bounded  by  nondeterministic  events.  It  has  been 
shown  [12]  that  by  considering  each  nondeterministic  event  log 
as  a  logical  checkpoint  [13]  taken  at  the  end  of  the  ensuing  state 
interval,  the  same  dependency  model  and  hence  the  checkpoint 
space  reclamation  algorithm  developed  in  this  p^per  can  stiff  be 
applied. 

IVaditionaily,  checkpoiat  space  redamation  for  uncoordinated 
checkpmnting  has  been  based  on  the  notion  of  obsolete  check¬ 
points:  the  global  recovery  line  which  suffices  to  recover  from 
the  failure  of  the  entire  system  is  computed;  then  all  of  the  ob¬ 
solete  checkpoints  before  that  recovery  line  are  no  longer  useful 
and  can  be  discarded.  In  contrast,  all  of  the  nonobsolete  check¬ 
points  have  been  assumed  to  be  possibly  useful  for  some  future 
recovery  and  should  be  retained.  With  the  possilnlity  of  domino 
effects,  the  number  of  nonobsolete  checkpoints  is  potentially  un¬ 
bounded. 

Motivated  by  the  observation  that  being  obsolete  is  simply 
a  sufficient  condition  for  being  garbage,  we  derive  a  necessary 
and  sufficient  condition  for  identi^dng  all  garbage  checkpoints, 
which  leads  to  an  optimal  checkpmnt  space  reclamation  algo¬ 
rithm  and  the  least  upper  bound  on  the  number  of  nongarbage 
checkpoints.  Our  approach  is  to  model  consistent  global  check¬ 
points  as  maximum-sued  antichains  of  the  partially  ordered  set 
generated  by  the  happened  before  relation  between  the  check¬ 
points.  We  define  a  recovery  line  transformation  and  decom¬ 
position,  and  demonstrate  that  any  nongarbage  checkpoint  be¬ 
longing  to  a  passible  future  recovery  line  must  also  be  contained 
in  one  of  the  N  “immediate  future”  recovery  lines,  where  N  is 
the  number  of  processes.  It  is  also  shown  that  these  N  recov¬ 
ery  lines  can  contain  at  most  N(N  -F  l)/2  distinct  nongatbage 
checkpoints. 

The  outline  of  the  paper  is  as  follows.  Section  II  describes 
the  checkpointing  and  recovery  protocol  and  a  model  of  con¬ 
sistent  global  checkpoints;  Section  III  derives  a  necessary  and 
sufficient  condition  for  identifying  all  nongarbage  checkpoints 
and  presents  the  optimal  checkpoint  space  reclamation  algo¬ 
rithm;  the  least  upper  bound  on  the  number  of  nongarbage 
checkpoints  is  derived  in  Section  IV  and  experimental  evalua¬ 
tion  is  described  in  Section  V.  Due  to  space  limitation,  some 
proofo  are  omitted  and  can  be  found  in  the  complete  technical 
report  [14]. 

II.  CHECKPOINTINa  AND  ROLLBACK  RECOVERY 
A.  System  Model  and  Recovery  Protocol 

The  system  considered  in  this  paper  consists  of  amumber 
of  concurrent  processes  for  which  all  process  communication  is 
through  message  passing.  Processes  are  assumed  to  run  on  fail- 
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stop  ptocesaon  [15]  utd,  for  the  purpose  of  presentutioii,  each 
process  is  considered  as  an  individual  recovery  unit.  In  order  to 
allow  general  nondeterministic  execution,  we  do  not  assume  the 
piecewise  deterministic  model.  This  implies  that  whenever  the 
sender  of  a  message  m  rolls  back  and  unsends  m,  the  receiver 
which  has  already  processed  m  must  also  roll  back  to  undo 
the  effect  of  m  because  the  potential  nondeterminism  preceding 
the  sending  of  m  may  prevent  the  same  message  from  being 
resent  during  reexecntion.  Let  Ci,x  denote  the  xth  checkpoint 
(x  >  0)  of  process  Pi  (0  <  i  <  Af  —  1),  where  N  is  the  number 
of  processes  in  the  system.  Two  checkpoints  Ci,^  and  Cj,y  are 
then  considered  inconstsfent  if  there  is  any  message  sent  after 
Cj,g  and  processed  before  Ci,^,  or  vice  versa.  In  contrast,  when 
the  receiver  of  a  message  m'  rolls  back  and  unreceives  m',  the 
sender  needs  not  roll  back  to  unsend  m'  if  m'  can  be  retrieved 
from  a  synchronous*  message  log  [16,17]  or  through  a  reliable 
end-to-end  transmission  protocol  [6]. 

During  normal  execution,  each  process  takes  its  local  check¬ 
points  periodically  without  coordinating  with  any  other  pro¬ 
cesses.  Let  (i,  x)  denote  the  xth  checkpoint  interval  of  process 
Pi  between  consecutive  checkpoints  c.,,  and  Ci,x.t.i.  Each  mes¬ 
sage  is  tagged  with  the  current  checkpoint  interval  number  and 
the  process  number  of  the  sender.  Each  receiver  pi  performs 
direct  dependency  tracking  [2,18]  as  follows:  if  a  message  sent 
from  (j),  y)  is  processed  in  (i,  x),  then  the  direct  dependency  of 
Ci,s4.i  on  C],y  is  recorded. 

A  garbage  collection  procedure  can  be  periodically  invoked 
by  any  process  pi  to  reclaim  the  storage  space  of  garbage  check¬ 
points.  First,  Pi  collects  the  direct  dependency  information  from 
all  the  other  processes  to  construct  the  checkpoint  graph  [2]  in 
which  each  vertex  represents  a  checkpoint  and  each  edge  rep¬ 
resents  a  direct  dependency  (including  the  implicit  dependency 
of  any  C;,y4.i  on  c^.y),  as  shown  in  Fig.  1(b).  Then  the  roll¬ 
back  propagation  algorithm  listed  in  Fig.  2  is  executed  on  the 
checkpoint  graph  to  determine  the  global  recovery  line*  (black 
vertices),  before  which  all  the  checkpoints  are  obsolete  (marked 
“X”)  and  can  be  discarded. 

Fig.  1.  Checkpointing  and  rollback  recovery,  (a)  Example 
checkpoint  and  communication  pattern;  (b)  checkpoint  graph 
and  extended  checkpoint  graph  when  po  initiates  a  rollback. 

Fig.  2.  The  rollback  propagation  algorithm. 

When  any  process  initiates  a  rollback,  it  starts  a  similar  pro¬ 
cedure  for  recovery.  The  current  volatile  states  of  the  surviving 
processes  are  treated  as  additional  virtual  checkpoints  [3]  for 
constmcting  an  extended  checkpoint  graph  of  which  the  recov¬ 
ery  line  is  called  the  local  recovery  line  (shaded  vertices)  and 
indicates  the  consistent  rollback  state. 

B.  A  Model  of  Consistent  Global  Checkpoints 

In  a  message-passing  system,  event  ei  directly  happened  be¬ 
fore  event  ej  [19]  if 

•  e\  and  ej  are  events  in  the  same  process  and  C]  occurs 
immediately  before  ej;  or 

s  ei  is  the  sending  of  a  message  m  and  ti  is  the  receiving  of 
m. 

The  transitive  closure  of  the  direct  happened  before  relation  is 
the  happened  before  relation,  denoted  by  <.  The  set  of  events 

*  Extension  of  our  work  to  an  asynchronous  logging  protocol  is  consid¬ 
ered  elsewhere  [4], 

^The  global  recovery  line  is  to  be  used  when  the  entire  system  fails, 
white  a  lodal  recovery  line  is  computed  when  only  a  subset  of  processes 
fail. 


with  the  happened  before  relation  forms  a  partially  ordered  set, 
or  poset  [19].  For  our  purpose,  we  consider  only  the  induced 
subposet  R  =s  (C,  <),  where  C  is  the  set  of  all  checkpoints. 

For  a  system  with  N  processes,  a  global  checkpoint  is  defined 
as  a  set  of  W  local  checkpoints,  one  from  each  process.  Based  on 
the  earlier  description  of  consistency,  a  consistent  global  check¬ 
point  is  a  global  checkpoint  of  which  no  two  constituent  check¬ 
points  are  ordered  by  the  happened  before  relation.  For  the  pur¬ 
pose  of  recovery,  we  are  interested  in  finding  the  latest  available 
consistent  global  checkpoint,  referred  to  as  the  recovery  line, 
which  minimises  the  total  rollback  distance. 

Our  approach  is  based  on  the  maximum-sized  antichain  model 
for  consistent  global  checkpoints  [18].  Given  a  poset  P  =  {S,<), 
an  antichainia  a  subset  A  of  5  such  that  x  ^  y  for  any  x,y  £  A. 
Intuitively,  a  consistent  global  checkpoint  corresponds  to  an  an¬ 
tichain  of  the  poset  R  =  (C,  <).  Since  the  initial  checkpoints  of 
all  processes  must  form  an  antichain  of  size  N  and  no  antichain 
can  contain  two  checkpoints  from  the  same  process,  the  largest 
size  of  any  antichain  in  A  is  exactly  N.  The  following  lemma 
summarizes  the  main  results  described  by  Wang  ef  al.  [18]. 

Lemma  1:  Given  the  poset  R  =  (C,  <)  of  checkpoints  gener¬ 
ated  by  the  happened  before  relation  and  M,M\,Mt  C  C,  let 
M{R)  denote  the  set  of  maximum-sixed  antichains  of  R. 

(a)  M  is  a  consistent  global  checkpoint  if  and  only  if  M  E 
M(R). 

(b)  Let  M[t[  denote  the  constituent  checkpoint  of  M  which  is 
a  checkpoint  of  pi  and,  for  any  M\,M2  E  M(R),  define 
Mi  :<  Mi  if  Mi[i]  <  Mi[i]  for  all  0  <  i  <  N  ~  1.  Then, 
the  poset  (M(R),  <)  forms  a  lattice. 

(c)  The  recovery  line  is  the  unique  maximal  maximum-sixed 
antichain,  denoted  by  M*(R),  on  the  lattice  (M(R),  ^). 

In  this  paper,  we  will  use  the  notation  M(C)  to  represent  the 
set  of  vaximnm-sized  antichains  of  the  poset  corresponding  to 
the  transitive  closure  of  the  checkpoint  graph  G.*  The  notation 
M*(G)  is  similarly  defined. 

III.  Optimal  Checkpoint  Space  Reclamation 
A.  Motivation  and  Problem  Formulation 

Since  a  future  program  execution  may  contain  arbitrary  check¬ 
point  dependencies  and  rollbacks,  we  first  describe  an  execution 
model  to  make  the  problem  tractable.  An  operational  session 
[3]  is  the  interval  between  the  start  of  normal  execution  and 
the  instance  of  rollback  initiation,  as  shown  in  Fig.  3.  A  recov¬ 
ery  session  immediately  follows  the  previous  operational  session 
and  ends  at  the  resumption  of  normal  execution.  A  program  ex¬ 
ecution  can  be  viewed  as  consisting  of  a  number  of  alternating 
operational  sessions  and  recovery  sessions.  In  terms  of  the  ef¬ 
fect  on  the  checkpoint  graphs,  new  vertices  are  added  as  new 
checkpoints  are  taken  during  w  operational  session,  and  exist¬ 
ing  vertices  can  be  deleted  as  some  checkpoints  are  invalidated 
by  the  rollback  during  a  recovery  session. 

Fig.  3.  Operational  sessions,  recovery  sessions  and  nongarbage 
checkpoints. 

Since  the  purpose  of  maintaining  checkpoints  is  for  possible 
future  recovery,  a  checkpoint  is  garbage  if  and  only  if  it  can 

^It  has  been  pointed  out  [18]  that  the  poset  R'  corresponding  to  the 
transitive  closure  of  the  checkpoint  graph  based  on  direct  dependency 
tracking  is  not  exactly  the  same  as  R.  However,  R'  and  R  possess  the 
same  set  of  maximum-sized  antichains. 


not  belong  to  any  future  recovery  line.  Being  obeolete,  i.e., 
bdbre  the  global  recovery  line,  is  simply  a  sufficient  condition 
for  being  garbage,  but  not  a  necessary  condition.  We  first  give 
an  examfde  of  nonobsolete  garbage  checkpoints.  Figure  4  is 
a  typical  example  illustrating  the  domino  effect.  The  global 
recovery  line  stays  at  the  set  of  initial  checkpoints  and  is  unable 
to  move  forward.  The  edge  bom  co.s  to  ci.s  and  the  one  bom 
ci,i  to  Co, 3  imply  that  co.s  >s  inconsistent  with  any  checkpoint 
of  process  pt .  Since  a  recovery  line  must  contain  one  checkpoint 
bom  each  process,  co.a  can  not  belong  to  any  future  recovery 
line*  and  is  therefore  a  garbage  checkpoint.  Checkpoints  ci,i 
and  co.i  are  garbage  by  similar  arguments. 

Fig.  4.  Example  of  nonobsolete  garbage  checkpoints. 

Figure  4  in  fact  provides  another  sufficient  condition  for  iden¬ 
tifying  garbage  checkpoints;  our  optimal  garbage  collection  aims 
at  deriving  the  necessary  and  sufficient  condition.  The  difficulty 
of  the  problem  lies  in  the  fact  that  future  process  execution 
may  contain  any  number  of  operational  sessions  (with  arbitrary 
checkpoint  dependencies)  and  recovery  sessions  (with  arbitrary 
subsets  of  processes  being  faulty).  We  outline  our  approach  as 
follows.  Instead  of  trying  to  find  garbage  checkpoints,  we  start 
with  identifying  nongarbage  checkpoints.  Given  any  possible  fu¬ 
ture  recovery  line  which  contains  some  nongarbage  checkpoints, 
for  example,  the  recovery  line  shown  in  Fig.  3.  we  perform  re¬ 
covery  fine  tratufomuUion  to  transform  it  into  another  recovery 
line  which  also  contains  those  nongarbage  checkpmnts.  We  show 
that  all  possible  future  recovery  lines  containing  any  nongarbage 
checkpoint  can  be  transformed  into  a  set  of  2^  ‘Immediate  fu¬ 
ture”  recovery  lines.  (Recall  that  N  is  the  number  of  processes.) 
Our  next  step  is  recovery  line  decomposition.  We  identib'  a  set  of 
N  recovery  lines  which  forms  the  “basu”  for  those  2^  recovery 
lines  and  therefore  contains  all  of  the  nongarbage  checkpoints. 

B.  Recovery  Line  Transformation 

Our  approach  to  transforming  an  arbitrary  future  recovery 
line  backwards  in  time  is  to  first  define  two  elementary  trans¬ 
formations:  transformation  mthtn  an  operaltonof  session  and 
transformation  across  a  recovery  session.  Any  transformation 
can  then  be  achieved  through  a  combination  of  these  two  ele¬ 
mentary  transformations. 

Transformation  within  an  operational  session 

During  normal  process  execution,  the  sise  of  the  checkpoint 
graph  increases  as  new  checkpoints  are  taken.  Because  check¬ 
point  graphs  represent  program  dependencies  and  are  not  arbi¬ 
trary  directed  acyclic  graphs,  the  following  rules  must  be  satis¬ 
fied  when  adding  new  vertices.  For  every  new  vertex  c<,x  with 
*>  1. 

Rule  Im:  Ci,x  must  have  an  incoming  edge  from  Ci,x-i; 

Rule  l.b:  c;,z  can  not  have  any  outgoing  edge  to  any  existing 
vertices  because  it  can  not  happen  before  a  checkpoint  that 
was  taken  earlier. 

We  use  Qt(G)  to  denote  the  set  of  all  potential  supergraphs  ob¬ 
tainable  by  adjoining  new  vertices  to  a  given  checkpoint  graph 
G  without  violating  Rule  l.a  and  Rule  l.b. 

Our  transformation  procedure  generally  involves  changing 
part  of  the  recovery  line  of  a  graph  Gi  to  obtain  the  recov¬ 
ery  line  of  another  graph  Gj .  The  following  lemma  wiU  be  used 

*It  is  not  hard  to  lee  that  eo,3  bving  a  garbage  checkpoint  will  not  be 
affected  by  the  occurrence  of  any  recovery  eeaeion  because  every  rollback 
either  preserves  the  “triangular"  condition  in  Pig.  4  for  eo,2  or  simply 
invalidates  cq,]. 


throughout  this  paper  to  ensure  that  the  unchanged  part,  which 
forms  an  antichain  in  Gi ,  remains  an  antichain  in  Gj  after  the 
transformation. 

Lemma  2:  Given  a  checkpoint  graph  G  =  {V,  E)  and  its  po¬ 
tential  supergraph  G'  =  (V',  E*)  €  Gt(G),  for  any  A  C  V ,  A  is 
an  anfichain  in  G  if  and  only  if  A  is  an  antichain  in  G' . 

One  special  potential  supergraph  of  G,  denoted  by  G,  will 
play  a  major  role  throughout  this  paper.  The  graph  G  is  con¬ 
structed  by  adjoining  a  new  vertex  n,  at  the  end  of  G  for  each  p, , 
with  a  single  incoming  edge  from  the  last  vertex  I, ,  as  shown  in 
Fig.  S.  Let  L  denote  the  set  of  all  last-nodes  I,  and  B  denote  the 
set  of  all  new-nodes  n,.  We  will  refer  to  the  2^  graphs  G  —W , 
W  C  B,  »s  the  immediate  supergraphs  of  G.  The  proof  of  the 
following  property  defines  the  r«covery  line  transformation 
within  an  operational  session: 

given  the  recovery  line  of  a  potential  supergraph  G' 
of  G,  by  replacing  its  constituent  checkpoints  which 
are  not  contained  in  G  with  their  corresponding 
new-nodes  of  G,  we  obtain  the  recovery  line  of  an 
immediate  supergraph  of  G. 

Fig.  5.  Construction  of  the  potential  supergraph  G. 

Property  1:  For  any  checkpointv  in  a  checkpoint  graph  G ,  ifv 
belongs  to  the  recovery  line  of  a  potential  supergraph  G' ,  then  v 
must  also  belong  to  the  recovery  line  of  an  immediate  supergraph 
of  G.  That  is.  given  G  —  (V,  E),  v  ^  V  and  G‘  €  ^j(G),  if 
v  €  Af*(G'),  then  v  €  -  W)  for  some  W  CB. 

Proof.  We  partition  M*(G')  into  M\  U  Afj  where  Mi  = 
M*(G')r\V  and  Mj  =  M*(G')\ V',  as  shown  in  Fig.  6.  A  corre¬ 
sponding  partition  of  the  new-nodes  of  G  is  given  as  B  —  Bi  UBj 
such  that  B%  s  {n;  :  M*(G')[i]  €  Mi)  and  Bj  s  {n^  : 
M*(G')[;]  €  Mi}.  Our  goal  is  to  show  that  M*(G  -  B,)  = 
Ml  U  Then,  for  any  v  C  V  and  v  €  M*(G'),  we  must  have 
V  €  Afi  C  M*(G  -  W)  where  W  =  Bi  Q  B. 

Fig.  6.  Recovery  line  transformation  within  an  operational  ses- 
sion. 

First  we  show  that  Mi  UB2  €  M(Cl—Bi ).  Define  the  subset 
Li  of  last-nodes  corresp  inding  to  Mi  as  Li  =  {1^  :  M*(G')[;]  € 
Ma).  Because  Mi  (  -fa  forms  an  antichain  in  G',  we  must 
have  M*(G')m  £  I  n  any  M’(G')[i]  €  Mi  and  6  Li. 
Now  consider  G -  Ve  have  M*(G')[i]  ^  i»,  for  any  n,  &  Bi 
because  each  n>  has  o.  ./  a  single  incoming  edge  from  I, .  Clearly, 
any  new-node  n>  £  M*(G')[i].  Lemma  2  farther  guarantees 
that  Mi(C  V)  remains  an  antichain  in  G  and  also  in  G  —  B\. 
Hence,  we  have  Mi  U  Ba  €  M{G  —  Bi). 

We  next  prove  that  Mi  UBa  =  M*{6  —  B\ )  by  contradiction. 
Suppose  Ml  V  Bi  ^  M’(G  —  Bi).  There  must  exist  M{  = 
M*(G  -  Bi)  \  Ba  such  that  M{  CV,  Mt:<  M{  and  Mi  /  Mi, 
as  shown  in  Fig.  6.  Now  consider  G'.  Recall  that  Mi  and 
Ma  form  an  antichain  in  G'  and  thus  for  any  u  €  M{  and 
M*{G')[j]  €  Ma,  we  must  have  u  ^  M*(G')[j].  We  also  have 
M*(G')[j]  ^  tt  by  Rule  l.b.  Therefore,  M{  U  Ma  forms  an 
antichain  in  G',  contradicting  the  fact  that  Mi  (J  Ma  is  the 
maximal  maximum-sized  antichain  of  G'.  □ 

The  transformation  within  an  operational  session  can  be  viewed 
as  “projecting”  a  potential  supergraph  along  the  direction  op¬ 
posite  to  the  time  axis.  It  shows  that  although  the  number  of 
potential  supergraphs  of  G  is  infinite,  the  recovery  lines  of  these 
graphs  can  intersect  G  in  only  a  finite  number  of  ways,  and  each 
of  the  possible  intersections  must  be  part  of  the  recovery  line  of 
an  immediate  supergraph  of  G. 


Tnn$formation  acroi*  a  recovery  settion 

EziatiBg  vertices  on  s  checkp<mt  gisph,  for  exsmple,  cjj 
in  Fig.  1(h),  can  be  deleted  due  to  loUbsck  recovery.  Let  Ge 
denote  the  extended  checkpoint  graph  as  defined  in  Section  II, 
G  —  (V,  E)  denote  the  subgraph  of  Gb  without  the  virtual 
checkpoints,  and  G~  =  (V~,E~)  denote  the  checkpoint  graph 
immediately  after  recovery.  Figure  ''  illustrates  these  checkpoint 
graphs.  Let  F  denote  the  part  of  G  deleted  by  the  rollback; 
then  we  have  G~  ~G  —  F.  By  definition,  M*{Gb)  is  the  local 
recovery  line.  Let  M*(Ge)  s  Mr  U  M«  as  shown  in  Fig.  7(a) 
where  Mr  =  M*(Gs)nV’  consists  of  real  checkpoints  and  M.  = 
M'(Gb)  \  V  consists  of  virtiud  checkpoints.  According  to  the 
rollback  propagation  algorithm,  the  following  two  rules  must  be 
satisfied  when  existing  vertices  are  deleted  during  recovery. 

Rule  2,a:  There  cannot  exist  any  u  €  Mr  and  w  €  V~  such 
that  «  <  w,  i.e.,  none  of  the  checkpoints  in  Mr  can  have 
any  outgoing  edge  in  G~; 

Rule  2.b:  For  any  «  in  F,  all  of  the  checkpoints  reachable  by  « 
must  also  be  in  F.  Consequently,  none  of  the  checkp<wta 
in  F  can  have  any  outgoing  edge  to  any  checkp<^ts  in  G~. 

Fig.  7.  Checkpoint  graphs  before  (G),  daring  (Ge)  and  after 
(G“)  recovery. 

Property  2  can  be  proved  [14]  by  defining  the  recovery  line 
transformation  across  a  recovery  session  as  follows: 

given  the  recovery  line  M  of  an  immediate  super- 
graph  of  G~,  for  any  t  such  that  M[i]  is  a  new-node 
and  M*(Ga)[t]  ts  not  a  virtual  checkpoint,  we  re¬ 
place  M[t]  with  M*(Gs)[t]  to  obtain  the  recovery 
line  of  an  immediate  supergraph  of  G. 

Property  2:  For  any  checkpoint  v  in  G" ,  if  v  helonge  to  the 
recovery  line  of  an  immediate  supergraph  of  G~ ,  then  v  must 
also  belong  to  the  recovery  line  of  an  immediate  supergraph  ofG. 
That  is,  given  G~  =  {V~,E~)  and  v  €  V~,  if  v  ^  M*(G~  — 
W~)  for  some  W"  C  B~ ,  Aen  v  €  M*(6  —  IV)  for  some 
W  C  B,  where  Cl~ ,  W~  and  B~  are  defin^forG~  in  parallel 
with  the  definitions  of  G,  W  and  B  for  G,  respectively. 

Complete  traruformation 

We  now  apply  Properties  1  and  2  to  transforming  an  arbitrary 
future  recovery  line  containing  any  nongarbage  checkpoints.  By 
repeatedly  applying  Property  1  within  every  operational  session 
and  Property  2  across  every  recovery  session,  we  demonstrate 
that  every  such  future  recovery  line  of  G  can  be  transformed 
into  the  recovery  line  of  an  immediate  supergraph  of  G  which 
preserves  all  of  those  nongarbage  checkpoints. 

Property  3:  ['transformation  property]  If  a  checkpoint  in 
G  belongs  to  a  future  recovery  line,  then  it  must  also  belong 
to  the  recovery  line  of  on  immediate  supergraph  of  G.  That 
is,  given  G  =  (V,  E)  and  v  €  V,  if  v  Q  M*(G')  for  a  future 
checkpoint  graph  G' ,  then  v  €  M*(dl  —  W)  for  some  W  C  B. 

Proof.  Without  loss  of  generality,  we  may  assume  G  is  in  the 
9th  operational  session  and  G'  belongs  to  the  rth  session  where 
T  >  q.  Let  Gi  denote  the  checkpoint  graph  at  the  end  of  the 
ith  operational  session,  G~  denote  the  checkpoint  graph  at  the 
beginning  of  the  same  session,  and  Wi  denote  a  subset  of  new- 
nodes  of  Gi.  Clearly,  v  must  belong  to  every  such  intermediate 
graph.  By  applying  Property  1  to  the  graph  pairs  (G',G7), 
(Gj  -  Wj,Gj)  where  9  -1-1  <  >  <  r  -  1  and  (G,  -  IV„G), 
and  applying  Property  2  to  the  graph  pairs  (Gj,G,-i)  where 
9  -f  1  <  i  <  r,  we  can  show  that  v  must  always  remain  on  the 


recovery  line  of  an  immediate  supergraph  of  one  of  the  interme¬ 
diate  graphs  throughout  the  transformation  procedure.  Even¬ 
tually,  we  have  v  €  M*(G  —  W)  for  some  W  C  B.  D 

Figure  8  gives  an  example  demonstrating  the  recovery  line 
transformation.  Figure  8(a)  is  the  current  checkpoint  graph  G 
considered  for  garbage  collection.  Suppose  that  Fig.  8(b)  is  the 
extended  checkpoint  graph  when  ps  initiates  a  rollback,  then 
Fig.  8(c)  is  the  checkpoint  graph  immediately  after  the  recovery. 
Figure  8(d)  shows  another  possible  extended  checkpoint  graph 
when  po  initiates  a  second  rollback.  Since  checkpoints  A  and  B 
are  needed  for  recovery  in  this  case,  they  should  be  considered 
nongarbage  checkpoints  of  G.  We  first  apply  Property  1  to 
the  graph  pairs  (Gs,  Gc)  and  transform  the  recovery  line  of  Ga 
into  the  recovery  line  of  G,  (an  immediate  supergraph  of  Gc)  by 
replacing  X,  Y  and  Z  with  their  corresponding  new- nodes  of  Gc, 
namely,  P,  Q  and  R,  respectively.  Then  we  apply  Property  2  to 
the  pair  (Ge,  Gb).  Since  ps  and  pe  contribute  real  checkpoints  C 
and  D,  respectively,  to  the  local  recovery  line  in  Fig.  8(b),  the 
recovery  line  of  Gg  is  transformed  into  the  recovery  line  of  G/ 
(an  immediate  supergraph  of  Gb)  by  replacing  Q  and  R  with  C 
and  D.  Finally,  by  applying  Property  1  to  the  pair  (G/,G),  we 
obtain  the  recovery  line  of  Ge  (an  immediate  supergraph  of  G) 
which  stiU  contains  the  nongarbage  checkpoints  A  and  B. 

Pig.  8.  Example  recovery  line  transformation. 

C.  Recovery  Line  Decomposition 

Property  3  states  that  the  recovery  lines  of  the  2'''^  immediate 
snpergiaphs  of  G  contain  all  nongarbage  checkpoints.  We  next 
show  that  there  exists  a  set  of  AT  recovery  lines  which  forms  a 
“basis”  for  the  2^  recovery  lines:  each  of  the  2^  recovery  lines 
is  the  set  of  minimal  elements  in  the  union  of  a  subset  of  the 
N  basis  recovery  lines.  Therefore,  it  suffices  to  find  these  N 
recovery  lines  to  identify  all  nongarbage  checkpoints. 

Let  X  aY  denote  the  meet  (greatest  lower  bound)  of  X  and 
y  in  a  lattice  and  min(5)  denote  the  set  of  minima]  elements  in 
5.  Based  on  the  following  property  from  Anderson’s  book  [20]: 
for  any  poset  Q  and  Mi,  Mj  €  M(Q),  M1AM2  =  min(MiUM2), 
we  can  show  by  induction  [14]  that  the  greatest  lower  bound  of 
any  k  maximum-sized  antichains  can  be  obtained  as  the  set  of 
minimal  elements  in  their  union. 

LemmnS:  Given  a  poset  P ,  M  €  M{P)  andM  X  M;  €  M(P) 
for  0  <  i  <  fc  —  1  for  any  finite  k,  define  Ao<i<*-i  ~ 
(...((Mo  A  Ml)  A  M2)  ...  )  A  Mk— 1.  Then 

(*)  M  :<  Ao<.<k-i  e  M(P); 

(*>)  Ao<i<k-i^<  =  «»^(Uo<i<k-i^0- 

The  following  lemma  which  states  the  relationship  between 
the  maximum-sized  antichains  of  G  and  those  of  its  potential 
supergraphs  is  also  required  for  proving  the  decomposition  prop¬ 
erty. 

Lemma  4:  Given  a  checkpoint  graph  G  =  (V,  E)  and  its  po¬ 
tential  supergraph  G'  —  (V',E'),  for  any  M  CV , 

(a)  M  €  Af(G)  if  and  only  if  M  £  M(G'); 

(b)  M’(G)  :<  M*(G'),- 

(c)  i/M  =  M*(G')  then  M  =  M*(G). 

Property  4;  [Decomposition  property]  For  every  W  C  B 
and  W  qtt,  M*(6  -W)  =  min(U„.g^^  M*(G  -  n;)). 


Proof.  Wit^at  lorn  of  generality,  let  s  |fii  :  0  <  •  < 
fc  —  1}  wkeie  1  <  ii  <  AT-  Since  d  —  nj  e  Ot(G  —  W)  for  all 
0<j<k  —  l,  ii*{6  —  W)  A/*(6  —  Wj)  by  Lemma  4(b). 

Now  conaider  the  gr^h  6.  Ftom  Lemma  4(a),  we  have 
M*(G  -  IK)  €  M(G)  and  Af(G  -  »,)  6  M(^)  for  all  0  <  >  < 
h  —  1.  Let  AfJ  as  min(Uo^^<fc_i  Af*(G  —  n,)).  From  Lemma  3, 
we  have 

M'(d-W):<  /\  M'(d-nj) 

Since  M*(G— n>)[}]  <  n>  and  thus  n>  ^  for  all  0  <  )  <  h— 1, 
every  x  g  must  be  contained  in  G  —  VK.  Ftom  Lemma 
4(a),  we  have  Ml  €  A4(6  —  W)  and  hence  Ml  ^  M*(6  — 
W).  Therefore,  we  have  proved  that  M*{G  —  W)  ss  Ml  = 
um{\J„^^„M-{G-ni)).  Q 

As  an  example,  we  demonstrate  the  decomposition  of  Af  *(G<) 
in  Fig.  8(e)  where  G»^G—  {no,  *>1,03,04).  From  Property  4 
and  referring  to  Fig.  9,  we  have 

«*(G.)  =  min(  |J  ll/*(G-nO) 

=  min({A,fl,oa,n3,n4,oo,/,oi,  J,G,f)}) 

*  {A,  N,  oa,  G,  .0) 

which  is  exactly  the  recovery  line  shown  in  Fig.  8(e). 

Fig.  9.  Example  of  the  PCSR  algorithm.  Shaded  checkpoints 
in  (a)-(e)  belong  to  the  recovery  lines  and  the  nonshaded  check- 
pmnts  in  (f)  are  garbage. 

D.  Pfedietivo  Checkpoint  Space  Reclamation  Algorithm 
We  ate  now  prepared  to  derive  a  necessary  and  snffident  con¬ 
dition  for  identi^ring  all  nongarbage  checkpoints. 

Theorem  1:  A  checkpoint  o  in  a  checkpoint  graph  G  i$  non- 
garkage  if  and  only  ifv^  M*{G  —  Oi)  for  some  0  <i  <  N  —  1. 

Proof.  If  »  €  M*(6  —  Oi)  for  some  0  <  i  <  W  —  1,  then  e  is 
nongarbage  because  G— oi  is  a  poesilde  future  checkpoint  graph 
of  G.  Conversely,  if  «  is  nongarbage,  we  have  by  definition  «  € 
M*(G')  for  some  future  chedrpoint  graph  G*.  From  Property 
3,  »  €  M*{G  —  W)  for  some  W  C  B;  from  Property  4, 

V  €  *"**»  (  U  Af(6-oO) 

»i€W 

C  y  M'iG-m)  C  U  il/*(G-oO. 

Therefore,  v  €  M*(G  —  Oj)  for  some  0<i<Ar  —  1.  O 

Baaed  on  Theorem  1  we  now  present  the  Predictive  Check¬ 
point  Space  Reclamation  (PCSR)  algorithm  in  Fig.  10  for  finding 
the  N  recovery  lines.  Since  the  rollback  propagation  algorithm 
in  Fig.  2  is  of  time  complexity  0(|£|)  where  |£|  is  the  total 
number  of  edges  in  the  checkpoint  graph  (as  every  edge  vis¬ 
ited  can  be  deleted),  the  PCSR  algorithm  is  of  time  complexity 

0(Ar|£|). 

Fig.  10.  The  Predictive  Checkpoint  Space  Reclamation  algo¬ 
rithm. 

An  example  illustrating  the  execution  of  the  PCBR  algorithm 
on  the  checkpmnt  graph  G  in  Fig.  5  is  shown  in  Fig.  9.  All 
of  the  checkpoints  in  G  are  nonobsolete  and  must  be  retained 
according  to  the  traditional  algorithm.  Our  PCSR  algorithm. 


hoivever,  determines  that  ail  of  the  nonshaded  checkpoints  in 
Pig.  9(f)  can  be  discarded. 

IV.  Least  Upper  Bound  on  Number 
OP  Nonoarbage  Checkpoints 

Theorem  1  not  only  identifies  the  minimum  set  of  nongarbage 
checkpoints  but  also  {daces  an  upi>er  bound  on  the  number 
of  ttongarbage  check{>oints  because  each  M'(6  —  n.),  0  <  1  < 
W  —  1,  consists  of  Af  check|>oint8.  The  following  pro|>erty  iden¬ 
tifies  the  inherent  relations  among  M*{G  —  and  is  the  key 
to  further  improving  the  upper  bound  to  the  least  upper 
bound  N{N  + 1)/2. 

Property  5:  For  any  0<i,j<N  —  1  and  i  /  j,  if  M’(G  — 
"OW  ^  andM‘(G-n~)[t]  then  Af  *(G-*i.)  =  Af*{G- 

n,). 

We  are  now  prepared  to  prove  the  second  major  result  of  this 
p^r. 

Theorem  2c  Let  Nf(G)  denote  the  set  of  nongarbage  check¬ 
points  ofG  and  N  be  the  number  of  processes.  Then,  |Af}(G)|  < 
Ar(Af  +  l)/2. 

Proof.  By  Theorem  1,  we  have  to  conaider  only  the  Af^  ver¬ 
tices  Af(6  -  ni)|>],  0  <  i,i  <  Af  -  1.  First,  Af(6  -  n,)[i]  for 
allO<i<Af  —  1  must  be  in  G  and  mnst  contribute  Af  vertices 
to  Aff(G).  For  the  remaining  Af^  —  Af  vertices  with  i  j,  we 
consider  the  pair  Af*(d  —  ni)\j]  and  M*{C!  —  Rj)[>]  one  at  a 
time  and  there  are  (Af^  —  Af)/2  such  {>airs.  We  distinguish  three 
cases: 

Case  1:  M*(6  —  i»i)t;)  =  n^  and  Af*((jr  —  n,)(i]  =  n*.  Both 
aew-aodea  do  not  bebng  to  Aff(G). 

Case  2:  Jf*(d>  -  *»<)[»]  *  »»>  *Rd  Af*(6  -  n,)[c]  m,  or 
M*{6  —  nt)[;']  yt  n,  and  M*{6  —  n>)[t’]  a  m.  This  |>air  will 
{Mssibly  add  one  new  checkpoint  to  Af,(G). 

Case  3:  M’{6  -  *i)DI  ^  Af*(G  -  n,)[tj  /  n,.  It  fol¬ 

lows  from  Property  5  that  M*(0  —  n*)  a  M*(6  -  n,),  and 
thus  M*(6  -  nOW  =  Af*(G  -  n,)[;]  and  Af*(G  -  n,)[i]  = 
ilf(<5  -  bOW-  Since  Af ’(6  -  n,)[;]  and  Af*(6  -  nOW  are 
already  in  Af((G),  this  case  does  not  contribute  any  new  non¬ 
garbage  checkpoint. 

Therefore,  each  of  the  (Af^  —  Af)/2  ]>air8  can  contribute  at 
most  one  new  checkpoint  to  Nf{G)  and  hence  |Af,(G)|  <  N  + 
(Af*  -  N)/2  X  1  a  N(N  +  l)/2.  □ 

We  next  show  that  Af(Af  l)/2  is  in  fact  the  least  up{>er 
bound  because  for  any  Af  we  can  construct  a  checkpoint  graph 
Gjir  as  shown  in  Fig.  11  to  achieve  this  up|ier  bound.  Figure  11 
shows  the  nongarbage  checkpoints  contributed  by  each  of  the 
Af  recovery  lines  in  the  PCSR  algorithm.  All  of  the  Af(  Af  -f- 1)/2 
checki>oists  are  identified  as  nongarbage  check|>oints. 

Fig.  11.  Gs'.  The  checkpoint  graph  with  Af(Af  +  l)/2  non¬ 
garbage  checkpoints. 

As  a  final  note,  the  greatest  lower  bound  of  Af  is  achieved 
when  none  of  the  (Af*  —  Af)/2  pairs  contributes  any  nongarbage 
checki>oint.  Coordinated  check|>ointing  protocols  [6]  guarantee 
that,  immediately  after  a  checkpointing  session,  the  last-node 
of  every  process  must  be  a  maximal  element;  as  a  result,  Case  1 
holds  for  all  pairs,  thereby  achieving  the  greatest  lower  bound. 

V.  Trace- Driven  Simulation  Results 

Four  parallel  programs  are  used  to  illustrate  the  checki>oint 
space  reclamation  capabilities  and  benefits  of  the  PCSR  algo¬ 
rithm.  Two  of  them  are  CAD  programs  written  for  Intel  iPSC/2 


kjrpeicabc:  Ceil  Placement  and  Channel  Router;  the  other  two 
are  Knight  Tonr  and  N-Queen  written  in  the  Chan  Kernel  Un- 
gnage,  which  haa  been  developed  as  a  me88age>driven  machine- 
independent  parallel  language  [21]-  We  use  the  Encore  Mul- 
timax  SIO  multiprocessor  version  of  the  Chare  Kernel.  Com¬ 
munication  traces  are  collected  for  these  four  programs,  and 
trace-driven  simulation  is  performed  to  obtain  the  results.  The 
checkpoint  interval  for  each  program  is  arbitrarily  chosen  to  be 
approximately  ten  percent  of  the  total  execution  time,  as  shown 
in  TaUe  1. 

Tkblel.  Execution  and  checkpoint  parameters  of  the  programs. 

Figure  12  compares  our  PCSR  algorithm  with  the  traditional 
algorithm  for  typcal  executions  of  the  four  programs.  Each 
curve  shows  the  number  of  checkpoints  which  would  be  retained 
if  the  algorithm  is  invoked  after  a  certain  number  of  checkpoints 
have  been  t^n.  The  domino  effect  is  iUustrated  by  the  linear 
increase  in  the  number  of  nonoberdete  checkpoints  as  the  total 
number  of  checkpmnts  increases.  The  largest  difference  between 
the  number  of  nonobsolete  checkpoints  and  the  number  of  non- 
garbage  checkpoints  for  each  program  is  39  versus  7  for  Cell 
Placement,  48  versus  12  for  Channd  Router,  24  versus  10  for 
Knight  Tonr  and  41  versus  S  for  N-Queen. 

Fi«-  12.  Nonobaolete  versus  nongarbage  checkpcants  for  the 
four  paralM  programs. 

VI.  Summary 

We  have  derived  a  necessary  and  sufficient  condition  for  iden¬ 
tifying  aU  garbage  checkpoints  in  an  uncoordinated  checkpoint¬ 
ing  protocol.  We  proved  that  there  exists  a  set  of  recovery 
lines,  where  N  is  the  number  of  processes,  such  that  any  check¬ 
point  useful  for  a  poasiUe  future  recovery  must  be  contained 
in  one  of  the  N  recovery  lines.  An  optimal  chedqxnnt  space 
reclamation  algorithm  of  time  complexity  0(W|£|),  where  |£| 
is  the  number  of  edges  in  the  checkpoint  graph,  was  presented 
to  identify  all  nongarbage  checkpoints;  the  storage  space  for 
the  remaining  checkpoints  can  then  be  reclaimed.  In  addition, 
we  demonstrated  that  the  least  upper  bound  on  the  number  of 
nongarbage  checkpoints  is  N{N  + 1)/2.  Communication  trace- 
driven  simulation  for  four  parallel  programs  demonstrated  that 
the  algorithm  can  be  effective  in  significantly  reducing  the  num¬ 
ber  of  retained  checkpoints. 
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NUMBERED  FOOTNOTES: 

(1)  Ebctension  of  our  work  to  an  asynchronous  logging  protocol  is  considered  elsewhere  [4]. 

(2)  The  global  recovery  line  is  to  be  used  when  the  entire  system  fuls,  while  a  local  recovery  line 
is  computed  when  only  a  subset  of  processes  fail. 

(3)  It  has  been  pointed  out  [18]  that  the  poset  R'  corresponding  to  the  transitive  closure  of  the 
checkpoint  graph  based  on  direct  dependency  tracking  is  not  exactly  the  same  as  R.  However,  R' 
and  R  possess  the  same  set  of  maximum-sized  antichains. 

(4)  It  is  not  hard  to  see  that  co,2  being  a  garbage  checkpoint  will  not  be  affected  by  the  occurrence 
of  any  recovery  session  because  every  rollback  either  preserves  the  “triangular”  condition  in  Fig.  4 
for  Co, 3  or  simply  invalidates  co.s. 


Fig.  1.  Checkpointing  and  rollback  recovery,  (a)  Example  checkpoint  and  communication  pattern; 
(b)  checkpoint  graph  and  extended  checkpoint  graph  when  po  initiates  a  rollback. 

Fig.  2.  The  rollback  propagation  algorithm. 

Fig.  3.  Operational  sessions,  recovery  sessions  and  nongarbage  checkpoints. 

Fig.  4.  Example  of  nonobsolete  garbage  checkpoints. 

Fig.  5.  Construction  of  the  potential  supergraph  G. 

Fig.  6.  Recovery  line  transformation  within  an  operational  session. 

Fig.  7.  Checkpoint  graphs  before  (G),  during  (Gg)  and  after  (G“)  recovery. 

Fig.  8.  Example  recovery  line  transformation. 

Fig.  9.  Example  of  the  PCSR  algorithm.  Shaded  checkpoints  in  (a)-(e)  belong  to  the  recovery 
lines  and  the  nonshaded  checkpoints  in  (f )  are  garbage. 

Fig.  10.  The  Predictive  Checkpoint  Space  Reclamation  algorithm. 

Fig.  11.  Gffi  The  checkpoint  graph  with  N{N  +  l)/2  nongarbage  checkpoints. 

Fig.  12.  Nonobsolete  versus  nongarbage  checkpoints  for  the  four  parallel  programs. 


Table  1.  Elxecution  and  checkpoint  parameters  of  the  programs. 
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/*  CP  stands  for  checkpoint  */ 

/*  Initially,  all  of  the  CPs  are  unmarked  */ 

include  the  latest  CP  of  each  process  in  the  root  set; 
mark  all  CPs  strictly  reachable  from  any  CP  in  the  root  set; 
while  (at  least  one  CP  in  the  root  set  is  marked)  { 

replace  each  marked  CP  in  the  root  set  by  the  latest  unmarked  CP  of  the  same 
process; 

mark  all  CPs  strictly  reachable  from  any  CP  in  the  root  set; 

} 

the  root  set  is  the  recovery  line. 


2 


Operational  Recovery 
seeeion  session 

1^  1_^J  U| 


I  .  .  I  .  .  I 
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/*  Ng{G)  denotes  the  set  of  nongarbage  checkpoints  of  G  */ 

/*  N  is  the  number  of  processes  */ 

/*  G  and  tii  are  as  defined  in  Fig.  5  V 

for  each  0<i<iV-l{ 

apply  the  rollback  propagation  algorithm  in  Fig.  2  to  the  checkpoint  graph  G  - 
to  find  the  recovery  line; 

all  checkpoints  in  the  recovery  line  except  for  the  new-nodes  are  included  in  the  set 

} 

all  of  the  checkpoints  not  in  Ng(G)  can  be  garbage-collected. 
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Table  1.  Ebcecution  and  checkpoint  parameters  of  the  programs. 


Benchmark 

programs 

Cell 

Placement 

Channel 

Router 

Knight 

Tour 

N-Queen 

Number  of 
processors 

8 

8 

6 

6 

Machine 

Intel  iPSC/2 
hypercube 

Intel  iPSC/2 
hypercube 

Encore 

Multimax 

Encore 

Multimax 

Execution 
time  (sec) 

322.7 

469.3 

273.2 

1625.1 

Checkpoint 
interval  (sec) 

35 

40 

30 

150 

1 


